Abstract
Lharba I love abstractEn 2021 à Taiwan, le nombre d’entreprises déclarées en faillite est en moyenne de 2400 par mois. La faillite des entreprises présente des enjeux d’envergure particulièrement durant la crise de la Covid-19, où de nombreuses entreprises à travers le monde se sont retrouvées en difficultés dans l’exercice de leurs activités. Nous observons aussi une hausse des prêts garantis par les États dans certains pays, afin de contenir les effets de la crise sur le financement des entreprises. Ainsi il serait crucial de trouver la méthodologie adéquate afin de pouvoir prédire les faillites d’entreprises dans l’optique de pouvoir estimer les besoins de chacune en matière d’aides et de financement.
Plusieurs travaux montrent que la faillite d’une entreprise est le fruit de difficultés financières mesurables par un large panel d’indicateurs. Ainsi, notre étude consistera à déterminer plusieurs modèles prédictifs sur la base d’indicateurs financiers de l’entreprise. Nous évaluerons par la suite la précision prédictive de chacun des modèles et nous comparerons les performances des meilleures spécifications.
Les données que nous utiliserons proviennent du Taiwan Economic Journal. Elles couvrent la période 1999-2009 et concernent des entreprises taïwanaises. La base de données compte alors 6819 observations ainsi que 95 variables, l’unité d’observation étant les entreprises.
L’étude de la prédiction des faillites d’entreprises a déjà été traité plusieurs fois. Les articles traitant de ce sujet, utilisent de multiples méthodes d’apprentissage automatique. Notre projet se basera sur la comparaison de plusieurs modèles de classification en tentant d’y ajouter des approches qui n’ont pas été traitées jusqu’à maintenant. Nous nous inspirerons également des modèles déjà effectués tout en essayant de les approfondir. De plus, nous tenterons d’aborder le choix des variables et leur traitement de manière différente.
Notre étude s’articulera en cinq parties :
Les travaux étudiant la prédiction de la faillite des entreprises sont nombreux et ont commencé à voir le jour dès les années 1930. Les précurseurs de cette question cruciale ont publiés une étude parue dans un rapport du Bureau of Business Research(1930), questionnant sur les déterminants de la défaillance des entreprises. L’échantillon de l’étude comportait 29 entreprises industrielles et consistait à comparer la valeur de 24 ratios financiers à la moyenne de l’échantillon et d’en tirer des conclusions quant aux caractéristiques similaires des entreprises défaillantes. Sans vouloir offenser ses utilisateurs, cette méthode d’évaluation univariée est bien évidement caduque et comporte un nombre important de biais.
D’autres études multivariées publiées dans les années 1960 à 1970 utilisaient principalement l’Analyse discrimante. Dans les années 1980 à 1990 l’Analyse Logit et les Réseaux Neuronaux étaient les méthodes d’évaluation prédominantes (voir Bellovary (2007) pour un résumé historique de la prédiction de la faillite).
Des études plus récentes utilisant non seulement d’autres méthodes statistiques mais également des échantillons plus large sont plus adaptées à notre problématique. Wu (2010) utilise des données de sociétés cotées au New York Stock Exchange et à l’American Express Company couvrant la période 1980 à 2006. Cinq modèles sont utilisés et comparés : Analyse discriminante multiple, Logit, Probit, Hazard model, Black-Scholes model et enfin un modèle logit multi-période. Les auteurs en arrivent à la conclusion que le Hazard model emprunté à Shumway (2001) surpasse en terme de performance prédictive les autres modèles. Les variables explicatives fournissant les prévisions les plus fiables sur les faillites d’entreprises sont relatives à des informations comptables, des données de marché et des caractéristiques de la société.
Pervan et al. (2011), utilisent des données de 156 entreprises en Croatie pour la période Janvier-Juin 2010 et des ratio financiers comme variables explicatives. Les entreprises étudiées sont plutôt hétérogènes. En effet, les secteurs industriels sélectionnés comprennent des entreprises opérant dans le secteur manufacturier et le commerce de gros. Les entreprises saines et en faillite sont du même nombre et sont sélectionnées de manière aléatoire. Les auteurs utilisent deux modèles : Régression logistique et Analyse discriminante. Le modèle d’Analyse discriminante a la précision la plus modeste dans la prédiction des entreprises en faillite, en effet, elle est de 79,5 % contre 85.9 % pour le modèle de Régression logistique.
Mu-Yen Chen (2011), utilise une base de données comportant 200 sociétés cotées à la Taiwan Stock Exchange Corporation. L’auteur utilise et compare 9 modèles différents avec des méthodes relatives à la classification par arbre de décision, aux réseaux neuronaux et aux techniques de calcul évolutif. Son modèle le plus performant en terme de prédiction est le PSO-SVM (Particle Swarm Optimization-Support Vector Machine). L’utilisation d’une Analyse en composante principale a permis de déterminer les variables appropriées à l’étude parmi 42 ratios, dont 33 financiers, 8 non financiers et 1 indice macroéconomique. Seuls 8 des 42 ratios ont été gardées après l’analyse en composante principale et toutes étaient des ratios financiers. L’auteur en conclu donc que les ratios financiers ont un effet plus important sur la performance de la prédiction financière que les ratios non financiers et les indices macroéconomiques.
Finalement, Deron Liang et al. (2016) utilisent la même base de données que celle de notre étude. Pour sélectionner les variables ils utilisent cinq méthodes : Analyse discriminante par étapes, la régression logistique par étapes, le test t, l’algorithme génétique et l’élimination récursive. Ils en concluent que la combinaison de ratios financiers et d’indicateurs de gouvernance d’entreprise sont plus adaptés que l’utilisation exclusive de ratios financiers pour prédire la faillite des entreprises. Leur modèles les plus performants sont : stepwise discriminant analysis (SDA) et support vector machine.
Les études se penchant sur cette problématique utilisent des données de nature différentes (type d’entreprises, variables explicatives, période etc.) , et la divergence de leurs conclusions relatives aux modèles les plus performant dans la prédiction de la faillite d’entreprises suggère que certaines modélisations semblent être plus adaptées en fonction des données à disposition.
df <- read_csv("C:/Users/33666/Desktop/M2/Projet_stat/data.csv")
The working directory was changed to C:/Users/33666/Documents inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
BLABLABLA
# Nombre total de valeurs manquantes
sum(is.na(df))
[1] 0
#Nombre total de duplicats
sum(duplicated(df))
[1] 0
dim(df)
[1] 6819 96
names(df)[names(df) == "Bankrupt?"] <- 'bk'
df$bk <- as.factor(df$bk)
df$bk <- relevel(df$bk, ref = 1)
datasummary_skim(df, type = "categorical", output = "kableExtra")
| bk | N | % |
|---|---|---|
| 0 | 6599 | 96.8 |
| 1 | 220 | 3.2 |
pct_format = scales::percent_format(accuracy = .1)
p <- ggplot(df, aes(x = bk, fill = bk)) +
geom_bar() +
geom_text(
aes(
label = sprintf(
'%d (%s)',
..count..,
pct_format(..count.. / sum(..count..))
)
),
stat = 'count',
nudge_y = .2,
colour = 'Dark blue',
size = 5)
p
Les variables exogènes ont d’abord été normalisées par la méthode dites du “feature scaling” : \[ F(x_i) = \frac{x_i \space- \space min(x)}{ max(x) \space - \space min(x)}\]
De ce fait les ###PK CA A ETE NORMALIZE ? ####
table1 <- psych::describe(df, skew = F, ranges=F)
table1 %>%
kbl(caption = "Recreating booktabs style table") %>%
kable_classic(full_width = F, html_font = "Cambria")
| vars | n | mean | sd | se | |
|---|---|---|---|---|---|
| bk* | 1 | 6819 | 1.032263e+00 | 1.767102e-01 | 2.139900e-03 |
| ROA(C) before interest and depreciation before interest | 2 | 6819 | 5.051796e-01 | 6.068560e-02 | 7.349000e-04 |
| ROA(A) before interest and % after tax | 3 | 6819 | 5.586249e-01 | 6.562000e-02 | 7.947000e-04 |
| ROA(B) before interest and depreciation after tax | 4 | 6819 | 5.535887e-01 | 6.159480e-02 | 7.459000e-04 |
| Operating Gross Margin | 5 | 6819 | 6.079480e-01 | 1.693380e-02 | 2.051000e-04 |
| Realized Sales Gross Margin | 6 | 6819 | 6.079295e-01 | 1.691610e-02 | 2.049000e-04 |
| Operating Profit Rate | 7 | 6819 | 9.987551e-01 | 1.301000e-02 | 1.575000e-04 |
| Pre-tax net Interest Rate | 8 | 6819 | 7.971898e-01 | 1.286900e-02 | 1.558000e-04 |
| After-tax net Interest Rate | 9 | 6819 | 8.090836e-01 | 1.360070e-02 | 1.647000e-04 |
| Non-industry income and expenditure/revenue | 10 | 6819 | 3.036229e-01 | 1.116340e-02 | 1.352000e-04 |
| Continuous interest rate (after tax) | 11 | 6819 | 7.813814e-01 | 1.267900e-02 | 1.535000e-04 |
| Operating Expense Rate | 12 | 6819 | 1.995347e+09 | 3.237684e+09 | 3.920795e+07 |
| Research and development expense rate | 13 | 6819 | 1.950427e+09 | 2.598292e+09 | 3.146499e+07 |
| Cash flow rate | 14 | 6819 | 4.674312e-01 | 1.703550e-02 | 2.063000e-04 |
| Interest-bearing debt interest rate | 15 | 6819 | 1.644801e+07 | 1.082750e+08 | 1.311197e+06 |
| Tax rate (A) | 16 | 6819 | 1.150007e-01 | 1.386675e-01 | 1.679200e-03 |
| Net Value Per Share (B) | 17 | 6819 | 1.906606e-01 | 3.338980e-02 | 4.043000e-04 |
| Net Value Per Share (A) | 18 | 6819 | 1.906332e-01 | 3.347350e-02 | 4.054000e-04 |
| Net Value Per Share (C) | 19 | 6819 | 1.906724e-01 | 3.348010e-02 | 4.054000e-04 |
| Persistent EPS in the Last Four Seasons | 20 | 6819 | 2.288129e-01 | 3.326260e-02 | 4.028000e-04 |
| Cash Flow Per Share | 21 | 6819 | 3.234819e-01 | 1.761090e-02 | 2.133000e-04 |
| Revenue Per Share (Yuan ¥) | 22 | 6819 | 1.328641e+06 | 5.170709e+07 | 6.261664e+05 |
| Operating Profit Per Share (Yuan ¥) | 23 | 6819 | 1.090907e-01 | 2.794220e-02 | 3.384000e-04 |
| Per Share Net profit before tax (Yuan) | 24 | 6819 | 1.843606e-01 | 3.318020e-02 | 4.018000e-04 |
| Realized Sales Gross Profit Growth Rate | 25 | 6819 | 2.240790e-02 | 1.207930e-02 | 1.463000e-04 |
| Operating Profit Growth Rate | 26 | 6819 | 8.479800e-01 | 1.075250e-02 | 1.302000e-04 |
| After-tax Net Profit Growth Rate | 27 | 6819 | 6.891461e-01 | 1.385300e-02 | 1.678000e-04 |
| Regular Net Profit Growth Rate | 28 | 6819 | 6.891500e-01 | 1.391030e-02 | 1.685000e-04 |
| Continuous Net Profit Growth Rate | 29 | 6819 | 2.176390e-01 | 1.006300e-02 | 1.219000e-04 |
| Total Asset Growth Rate | 30 | 6819 | 5.508097e+09 | 2.897718e+09 | 3.509100e+07 |
| Net Value Growth Rate | 31 | 6819 | 1.566212e+06 | 1.141594e+08 | 1.382456e+06 |
| Total Asset Return Growth Rate Ratio | 32 | 6819 | 2.642475e-01 | 9.634200e-03 | 1.167000e-04 |
| Cash Reinvestment % | 33 | 6819 | 3.796767e-01 | 2.073660e-02 | 2.511000e-04 |
| Current Ratio | 34 | 6819 | 4.032850e+05 | 3.330216e+07 | 4.032849e+05 |
| Quick Ratio | 35 | 6819 | 8.376595e+06 | 2.446847e+08 | 2.963102e+06 |
| Interest Expense Ratio | 36 | 6819 | 6.309910e-01 | 1.123850e-02 | 1.361000e-04 |
| Total debt/Total net worth | 37 | 6819 | 4.416337e+06 | 1.684069e+08 | 2.039387e+06 |
| Debt ratio % | 38 | 6819 | 1.131771e-01 | 5.392030e-02 | 6.530000e-04 |
| Net worth/Assets | 39 | 6819 | 8.868229e-01 | 5.392030e-02 | 6.530000e-04 |
| Long-term fund suitability ratio (A) | 40 | 6819 | 8.782700e-03 | 2.815290e-02 | 3.409000e-04 |
| Borrowing dependency | 41 | 6819 | 3.746543e-01 | 1.628620e-02 | 1.972000e-04 |
| Contingent liabilities/Net worth | 42 | 6819 | 5.968300e-03 | 1.218840e-02 | 1.476000e-04 |
| Operating profit/Paid-in capital | 43 | 6819 | 1.089767e-01 | 2.778170e-02 | 3.364000e-04 |
| Net profit before tax/Paid-in capital | 44 | 6819 | 1.827150e-01 | 3.078480e-02 | 3.728000e-04 |
| Inventory and accounts receivable/Net value | 45 | 6819 | 4.024593e-01 | 1.332410e-02 | 1.614000e-04 |
| Total Asset Turnover | 46 | 6819 | 1.416056e-01 | 1.011450e-01 | 1.224900e-03 |
| Accounts Receivable Turnover | 47 | 6819 | 1.278971e+07 | 2.782598e+08 | 3.369692e+06 |
| Average Collection Days | 48 | 6819 | 9.826221e+06 | 2.563589e+08 | 3.104474e+06 |
| Inventory Turnover Rate (times) | 49 | 6819 | 2.149106e+09 | 3.247967e+09 | 3.933247e+07 |
| Fixed Assets Turnover Frequency | 50 | 6819 | 1.008596e+09 | 2.477557e+09 | 3.000291e+07 |
| Net Worth Turnover Rate (times) | 51 | 6819 | 3.859510e-02 | 3.668030e-02 | 4.442000e-04 |
| Revenue per person | 52 | 6819 | 2.325854e+06 | 1.366327e+08 | 1.654604e+06 |
| Operating profit per person | 53 | 6819 | 4.006710e-01 | 3.272010e-02 | 3.962000e-04 |
| Allocation rate per person | 54 | 6819 | 1.125579e+07 | 2.945063e+08 | 3.566434e+06 |
| Working Capital to Total Assets | 55 | 6819 | 8.141252e-01 | 5.905440e-02 | 7.151000e-04 |
| Quick Assets/Total Assets | 56 | 6819 | 4.001318e-01 | 2.019981e-01 | 2.446200e-03 |
| Current Assets/Total Assets | 57 | 6819 | 5.222734e-01 | 2.181118e-01 | 2.641300e-03 |
| Cash/Total Assets | 58 | 6819 | 1.240946e-01 | 1.392506e-01 | 1.686300e-03 |
| Quick Assets/Current Liability | 59 | 6819 | 3.592902e+06 | 1.716209e+08 | 2.078308e+06 |
| Cash/Current Liability | 60 | 6819 | 3.715999e+07 | 5.103509e+08 | 6.180286e+06 |
| Current Liability to Assets | 61 | 6819 | 9.067280e-02 | 5.028990e-02 | 6.090000e-04 |
| Operating Funds to Liability | 62 | 6819 | 3.538280e-01 | 3.514720e-02 | 4.256000e-04 |
| Inventory/Working Capital | 63 | 6819 | 2.773951e-01 | 1.046880e-02 | 1.268000e-04 |
| Inventory/Current Liability | 64 | 6819 | 5.580680e+07 | 5.820516e+08 | 7.048571e+06 |
| Current Liabilities/Liability | 65 | 6819 | 7.615989e-01 | 2.066768e-01 | 2.502800e-03 |
| Working Capital/Equity | 66 | 6819 | 7.358165e-01 | 1.167800e-02 | 1.414000e-04 |
| Current Liabilities/Equity | 67 | 6819 | 3.314098e-01 | 1.348800e-02 | 1.633000e-04 |
| Long-term Liability to Current Assets | 68 | 6819 | 5.416004e+07 | 5.702706e+08 | 6.905906e+06 |
| Retained Earnings to Total Assets | 69 | 6819 | 9.347328e-01 | 2.556420e-02 | 3.096000e-04 |
| Total income/Total expense | 70 | 6819 | 2.548900e-03 | 1.209280e-02 | 1.464000e-04 |
| Total expense/Assets | 71 | 6819 | 2.918410e-02 | 2.714880e-02 | 3.288000e-04 |
| Current Asset Turnover Rate | 72 | 6819 | 1.195856e+09 | 2.821161e+09 | 3.416391e+07 |
| Quick Asset Turnover Rate | 73 | 6819 | 2.163735e+09 | 3.374944e+09 | 4.087015e+07 |
| Working capitcal Turnover Rate | 74 | 6819 | 5.940063e-01 | 8.959400e-03 | 1.085000e-04 |
| Cash Turnover Rate | 75 | 6819 | 2.471977e+09 | 2.938623e+09 | 3.558636e+07 |
| Cash Flow to Sales | 76 | 6819 | 6.715308e-01 | 9.341300e-03 | 1.131000e-04 |
| Fixed Assets to Assets | 77 | 6819 | 1.220121e+06 | 1.007542e+08 | 1.220120e+06 |
| Current Liability to Liability | 78 | 6819 | 7.615989e-01 | 2.066768e-01 | 2.502800e-03 |
| Current Liability to Equity | 79 | 6819 | 3.314098e-01 | 1.348800e-02 | 1.633000e-04 |
| Equity to Long-term Liability | 80 | 6819 | 1.156447e-01 | 1.952920e-02 | 2.365000e-04 |
| Cash Flow to Total Assets | 81 | 6819 | 6.497306e-01 | 4.737210e-02 | 5.737000e-04 |
| Cash Flow to Liability | 82 | 6819 | 4.618493e-01 | 2.994270e-02 | 3.626000e-04 |
| CFO to Assets | 83 | 6819 | 5.934151e-01 | 5.856060e-02 | 7.092000e-04 |
| Cash Flow to Equity | 84 | 6819 | 3.155824e-01 | 1.296090e-02 | 1.570000e-04 |
| Current Liability to Current Assets | 85 | 6819 | 3.150640e-02 | 3.084470e-02 | 3.735000e-04 |
| Liability-Assets Flag | 86 | 6819 | 1.173200e-03 | 3.423430e-02 | 4.146000e-04 |
| Net Income to Total Assets | 87 | 6819 | 8.077602e-01 | 4.033220e-02 | 4.884000e-04 |
| Total assets to GNP price | 88 | 6819 | 1.862942e+07 | 3.764501e+08 | 4.558763e+06 |
| No-credit Interval | 89 | 6819 | 6.239146e-01 | 1.228950e-02 | 1.488000e-04 |
| Gross Profit to Sales | 90 | 6819 | 6.079463e-01 | 1.693380e-02 | 2.051000e-04 |
| Net Income to Stockholder's Equity | 91 | 6819 | 8.404021e-01 | 1.452250e-02 | 1.759000e-04 |
| Liability to Equity | 92 | 6819 | 2.803652e-01 | 1.446320e-02 | 1.751000e-04 |
| Degree of Financial Leverage (DFL) | 93 | 6819 | 2.754110e-02 | 1.566790e-02 | 1.897000e-04 |
| Interest Coverage Ratio (Interest expense to EBIT) | 94 | 6819 | 5.653579e-01 | 1.321420e-02 | 1.600000e-04 |
| Net Income Flag | 95 | 6819 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| Equity to Liability | 96 | 6819 | 4.757840e-02 | 5.001370e-02 | 6.057000e-04 |
Lorsque nous observons les variables de la dataset -on voit que certaines sont des variantes des autres tel que un ratio avant / apres interet, on en garde un à chaque fois -Il y’a des variables qui se repètent et on les même valeurs distribution (eg: Current Liability to Liability et Liability to Liability) -Certains ratios sont n’ont que de changement leurs dénominateurs
BLABLA
include_graphics("GRAPH1.drawio.png")
###Variable inutiles redondantes
colnames(df)
[1] "bk"
[2] "ROA(C) before interest and depreciation before interest"
[3] "ROA(A) before interest and % after tax"
[4] "ROA(B) before interest and depreciation after tax"
[5] "Operating Gross Margin"
[6] "Realized Sales Gross Margin"
[7] "Operating Profit Rate"
[8] "Pre-tax net Interest Rate"
[9] "After-tax net Interest Rate"
[10] "Non-industry income and expenditure/revenue"
[11] "Continuous interest rate (after tax)"
[12] "Operating Expense Rate"
[13] "Research and development expense rate"
[14] "Cash flow rate"
[15] "Interest-bearing debt interest rate"
[16] "Tax rate (A)"
[17] "Net Value Per Share (B)"
[18] "Net Value Per Share (A)"
[19] "Net Value Per Share (C)"
[20] "Persistent EPS in the Last Four Seasons"
[21] "Cash Flow Per Share"
[22] "Revenue Per Share (Yuan ¥)"
[23] "Operating Profit Per Share (Yuan ¥)"
[24] "Per Share Net profit before tax (Yuan)"
[25] "Realized Sales Gross Profit Growth Rate"
[26] "Operating Profit Growth Rate"
[27] "After-tax Net Profit Growth Rate"
[28] "Regular Net Profit Growth Rate"
[29] "Continuous Net Profit Growth Rate"
[30] "Total Asset Growth Rate"
[31] "Net Value Growth Rate"
[32] "Total Asset Return Growth Rate Ratio"
[33] "Cash Reinvestment %"
[34] "Current Ratio"
[35] "Quick Ratio"
[36] "Interest Expense Ratio"
[37] "Total debt/Total net worth"
[38] "Debt ratio %"
[39] "Net worth/Assets"
[40] "Long-term fund suitability ratio (A)"
[41] "Borrowing dependency"
[42] "Contingent liabilities/Net worth"
[43] "Operating profit/Paid-in capital"
[44] "Net profit before tax/Paid-in capital"
[45] "Inventory and accounts receivable/Net value"
[46] "Total Asset Turnover"
[47] "Accounts Receivable Turnover"
[48] "Average Collection Days"
[49] "Inventory Turnover Rate (times)"
[50] "Fixed Assets Turnover Frequency"
[51] "Net Worth Turnover Rate (times)"
[52] "Revenue per person"
[53] "Operating profit per person"
[54] "Allocation rate per person"
[55] "Working Capital to Total Assets"
[56] "Quick Assets/Total Assets"
[57] "Current Assets/Total Assets"
[58] "Cash/Total Assets"
[59] "Quick Assets/Current Liability"
[60] "Cash/Current Liability"
[61] "Current Liability to Assets"
[62] "Operating Funds to Liability"
[63] "Inventory/Working Capital"
[64] "Inventory/Current Liability"
[65] "Current Liabilities/Liability"
[66] "Working Capital/Equity"
[67] "Current Liabilities/Equity"
[68] "Long-term Liability to Current Assets"
[69] "Retained Earnings to Total Assets"
[70] "Total income/Total expense"
[71] "Total expense/Assets"
[72] "Current Asset Turnover Rate"
[73] "Quick Asset Turnover Rate"
[74] "Working capitcal Turnover Rate"
[75] "Cash Turnover Rate"
[76] "Cash Flow to Sales"
[77] "Fixed Assets to Assets"
[78] "Current Liability to Liability"
[79] "Current Liability to Equity"
[80] "Equity to Long-term Liability"
[81] "Cash Flow to Total Assets"
[82] "Cash Flow to Liability"
[83] "CFO to Assets"
[84] "Cash Flow to Equity"
[85] "Current Liability to Current Assets"
[86] "Liability-Assets Flag"
[87] "Net Income to Total Assets"
[88] "Total assets to GNP price"
[89] "No-credit Interval"
[90] "Gross Profit to Sales"
[91] "Net Income to Stockholder's Equity"
[92] "Liability to Equity"
[93] "Degree of Financial Leverage (DFL)"
[94] "Interest Coverage Ratio (Interest expense to EBIT)"
[95] "Net Income Flag"
[96] "Equity to Liability"
df[ ,c(4,3,8,18,19,23,79,67,78)] <- list(NULL)
predictors <- df[,-c(1)]
rmp1<-names(predictors)[nearZeroVar(predictors)] # Nous donne les variables ne prenant qu'une seul valeur ou ayant une variance quasi-nulle
print(rmp1)
[1] "Liability-Assets Flag" "Net Income Flag"
table(predictors$`Net Income Flag`)
1
6819
table(predictors$`Liability-Assets Flag`)
0 1
6811 8
df[ ,c("Liability-Assets Flag", "Net Income Flag")] <- list(NULL)
Nous pouvons d’ores et déja retirer les variables Liability-Assets Flag et Net Income Flag. En effet, la variable Liability-Assets Flag possède une variance très proches de 0, ce qui veut dire qu’elle ne prend pas beacoup de valeurs différentes. Ensuite, concernant Net Income Flag, la variable ne comporte qu’une seule valeur unique égale à 1.
df = pd.read_csv("C:/Users/33666/Desktop/M2/Projet_stat/data.csv")
p = df
p.drop(p.columns[[0,3,2,7,17,18,22,78,66, 94, 85, 77]], axis = 1, inplace=True)
James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. Springer; 2013. –> VIF 10 is problematic
feature VIF
75 Net Income to Total Assets 1.147856e+01
69 Equity to Long-term Liability 1.245814e+01
16 Per Share Net profit before tax (Yuan) 1.974217e+01
13 Persistent EPS in the Last Four Seasons 2.203057e+01
36 Net profit before tax/Paid-in capital 2.311774e+01
67 Cash Flow to Sales 2.649558e+01
37 Inventory and accounts receivable/Net value 3.269785e+01
33 Borrowing dependency 3.694166e+01
58 Working Capital/Equity 3.708076e+01
65 Working capitcal Turnover Rate 8.654667e+01
20 Regular Net Profit Growth Rate 1.325916e+02
19 After-tax Net Profit Growth Rate 1.333995e+02
80 Liability to Equity 1.373998e+02
5 Non-industry income and expenditure/revenue 2.279432e+02
6 Continuous interest rate (after tax) 3.365468e+02
2 Realized Sales Gross Margin 1.074951e+03
3 Operating Profit Rate 1.723033e+03
4 After-tax net Interest Rate 1.815654e+03
78 Gross Profit to Sales 1.218257e+07
1 Operating Gross Margin 4.034741e+07
31 Net worth/Assets 8.148682e+08
30 Debt ratio % 2.160007e+09
49 Current Assets/Total Assets 3.624503e+09
53 Current Liability to Assets 6.458329e+09
47 Working Capital to Total Assets 1.183608e+10
df[ ,c('Net Income to Total Assets','Equity to Long-term Liability','Per Share Net profit before tax (Yuan)', 'Persistent EPS in the Last Four Seasons', 'Net profit before tax/Paid-in capital', 'Cash Flow to Sales', 'Inventory and accounts receivable/Net value', 'Borrowing dependency', 'Working Capital/Equity', 'Working capitcal Turnover Rate', 'Regular Net Profit Growth Rate', ' fter-tax Net Profit Growth Rate', 'Liability to Equity', 'Non-industry income and expenditure/revenue', 'Continuous interest rate (after tax)', 'Realized Sales Gross Margin', 'Operating Profit Rate', 'After-tax net Interest Rate', 'Gross Profit to Sales', 'Operating Gross Margin', 'Net worth/Assets', 'Debt ratio %', 'Current Assets/Total Assets', 'Current Liability to Assets', ' orking Capital to Total Assets')] <- list(NULL)
library(DiscriMiner) # La fonction c'est Quanti-Quali
le package 㤼㸱DiscriMiner㤼㸲 a 攼㸹t攼㸹 compil攼㸹 avec la version R 4.1.3Registered S3 method overwritten by 'DiscriMiner':
method from
print.plsda caret
pred <- colnames(df[,-1])
pred <- as.data.frame(pred)
pred$cor_ratio <- NA
pred=as.matrix(pred)
for (i in 1:61) {
grp=as.factor(df$bk)
name = as.character(pred[i,1])
pred[i,2] = as.numeric(corRatio(df[[name]],grp))
}
pred=as.data.frame(pred)
pred[order(pred$cor_ratio, decreasing = FALSE),]
NA
lharba <- pred[pred$cor_ratio <= median(pred$cor_ratio), ]
lharba <- as.character(lharba$pred)
print(lharba)
[1] "Research and development expense rate" "Cash flow rate"
[3] "Interest-bearing debt interest rate" "Cash Flow Per Share"
[5] "Operating Profit Growth Rate" "After-tax Net Profit Growth Rate"
[7] "Total Asset Growth Rate" "Net Value Growth Rate"
[9] "Total Asset Return Growth Rate Ratio" "Cash Reinvestment %"
[11] "Quick Ratio" "Total debt/Total net worth"
[13] "Long-term fund suitability ratio (A)" "Contingent liabilities/Net worth"
[15] "Total Asset Turnover" "Fixed Assets Turnover Frequency"
[17] "Net Worth Turnover Rate (times)" "Revenue per person"
[19] "Cash/Current Liability" "Operating Funds to Liability"
[21] "Current Liabilities/Liability" "Current Asset Turnover Rate"
[23] "Quick Asset Turnover Rate" "Cash Turnover Rate"
[25] "Fixed Assets to Assets" "Cash Flow to Total Assets"
[27] "Cash Flow to Liability" "Cash Flow to Equity"
[29] "Total assets to GNP price" "Degree of Financial Leverage (DFL)"
[31] "Equity to Liability"
for (col in lharba)
df[[col]]<-NULL
dim(df)
[1] 6819 31
options(scipen = 999)# In order to disable Scientific Notation
myPr<-prcomp(df[,-1],scale=FALSE) # Creating PCs
summary(myPr)
Importance of components:
PC1 PC2 PC3
Standard deviation 3446078327.6927 3026169807.3978 581357119.11620
Proportion of Variance 0.5408 0.4170 0.01539
Cumulative Proportion 0.5408 0.9578 0.97323
PC4 PC5 PC6
Standard deviation 570224898.99792 294387560.07746 278033268.95322
Proportion of Variance 0.01481 0.00395 0.00352
Cumulative Proportion 0.98804 0.99198 0.99550
PC7 PC8 PC9 PC10
Standard deviation 256338088.89731 171588657.56087 49786465.84299 33299783.49166
Proportion of Variance 0.00299 0.00134 0.00011 0.00005
Cumulative Proportion 0.99850 0.99984 0.99995 1.00000
PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18
Standard deviation 0.2274 0.1394 0.103 0.07082 0.04787 0.04083 0.02987 0.02746
Proportion of Variance 0.0000 0.0000 0.000 0.00000 0.00000 0.00000 0.00000 0.00000
Cumulative Proportion 1.0000 1.0000 1.000 1.00000 1.00000 1.00000 1.00000 1.00000
PC19 PC20 PC21 PC22 PC23 PC24 PC25
Standard deviation 0.02585 0.01856 0.01539 0.01408 0.01373 0.01323 0.01226
Proportion of Variance 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
Cumulative Proportion 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
PC26 PC27 PC28 PC29 PC30
Standard deviation 0.01201 0.0112 0.01049 0.01039 0.009755
Proportion of Variance 0.00000 0.0000 0.00000 0.00000 0.000000
Cumulative Proportion 1.00000 1.0000 1.00000 1.00000 1.000000
plot(myPr)
Loadings <- myPr$rotation # Extracting PCs for regression
axes <- predict(myPr, newdata = df)
acp_df <- cbind(df[1], axes) # ajout des Pcs dans une nouvelle base de données
acp_df[,4:31]<-list(NULL)
include_graphics("GRAPH2.drawio.png")
NA
library(parallel)
library(doParallel)
Le chargement a n攼㸹cessit攼㸹 le package : foreach
Attachement du package : 㤼㸱foreach㤼㸲
Les objets suivants sont masqu攼㸹s depuis 㤼㸱package:purrr㤼㸲:
accumulate, when
Le chargement a n攼㸹cessit攼㸹 le package : iterators
# control variables
method <- "boot"
numbers <- 30 # Number of bootstrap samples
bTunes <- 30 # tune number of models
seed <- 777
# Seeds
bSeeds <- bootsetSeeds(method = method, numbers = numbers, tunes = bTunes,
seed = seed)
# Configure the trainControl argument for cross-validation
sixStats <- function(...) c(twoClassSummary(...),
defaultSummary(...), mnLogLoss(...))
bCtrl <- trainControl(method = method, number = numbers,
classProbs = TRUE, savePredictions = TRUE,
seeds = bSeeds, summaryFunction = sixStats,
allowParallel = TRUE)
set.seed(1)
inIndex <- createDataPartition(acp_df$bk, p = .3, list = FALSE, times = 1)
acp_train <- acp_df[inIndex,]
acp_test <- acp_df[-inIndex,]
acp_test$bk<-ifelse(acp_test$bk==1,"Yes","No")
acp_train$bk<-ifelse(acp_train$bk==1,"Yes","No")
\(SMOTE\)
library(themis)
bCtrl$sampling <- "smote"
nrcore <- 5
cl <- makeCluster(mc <- getOption("cl.cores", nrcore))
registerDoParallel(cl)
set.seed(777)
acp_logitsmote <- train(bk ~ ., data = acp_train, method = "glm",
trControl = bCtrl)
stopCluster(cl)
acp_logitsmote$results %>%
kable(digits=2) %>%
kable_styling(latex_options = "HOLD_position")
| parameter | ROC | Sens | Spec | Accuracy | Kappa | logLoss | ROCSD | SensSD | SpecSD | AccuracySD | KappaSD | logLossSD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| none | 0.46 | 0.47 | 0.47 | 0.47 | -0.01 | 0.69 | 0.03 | 0.14 | 0.18 | 0.13 | 0.01 | 0 |
bCtrl$sampling <- "rose"
nrcore <- 5
cl <- makeCluster(mc <- getOption("cl.cores", nrcore))
registerDoParallel(cl)
set.seed(777)
acp_logitrose <- train(bk ~ ., data = acp_train, method = "glm", trControl = bCtrl)
le package 㤼㸱ROSE㤼㸲 a 攼㸹t攼㸹 compil攼㸹 avec la version R 4.1.3Loaded ROSE 0.0-4
stopCluster(cl)
acp_logitrose$results %>%
kable(digits=2) %>%
kable_styling(latex_options = "striped")
| parameter | ROC | Sens | Spec | Accuracy | Kappa | logLoss | ROCSD | SensSD | SpecSD | AccuracySD | KappaSD | logLossSD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| none | 0.47 | 0.54 | 0.41 | 0.54 | -0.01 | 0.69 | 0.04 | 0.22 | 0.24 | 0.2 | 0.01 | 0.02 |
bCtrl$sampling <- "down"
nrcore <- 5
cl <- makeCluster(mc <- getOption("cl.cores", nrcore))
registerDoParallel(cl)
set.seed(777)
acp_logitdown <- train(bk ~ ., data = acp_train, method = "glm",
trControl = bCtrl)
stopCluster(cl)
acp_logitdown$results %>%
kable(digits=2) %>%
kable_styling(latex_options = "HOLD_position")
| parameter | ROC | Sens | Spec | Accuracy | Kappa | logLoss | ROCSD | SensSD | SpecSD | AccuracySD | KappaSD | logLossSD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| none | 0.49 | 0.52 | 0.43 | 0.52 | -0.01 | 0.7 | 0.04 | 0.16 | 0.2 | 0.15 | 0.01 | 0.01 |
bCtrl$sampling <- "up"
nrcore <- 5
cl <- makeCluster(mc <- getOption("cl.cores", nrcore))
registerDoParallel(cl)
set.seed(777)
acp_logitup <- train(bk ~ ., data = acp_train, method = "glm",
trControl = bCtrl)
stopCluster(cl)
acp_logitup$results %>%
kable(digits=2) %>%
kable_styling(latex_options = "HOLD_position")
| parameter | ROC | Sens | Spec | Accuracy | Kappa | logLoss | ROCSD | SensSD | SpecSD | AccuracySD | KappaSD | logLossSD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| none | 0.46 | 0.54 | 0.39 | 0.53 | -0.01 | 0.69 | 0.03 | 0.14 | 0.17 | 0.13 | 0.01 | 0 |
set.seed(1)
inIndex <- createDataPartition(df$bk, p = .3, list = FALSE, times = 1)
train <- df[inIndex,]
test <- df[-inIndex,]
test$bk<-ifelse(test$bk==1,"Yes","No")
train$bk<-ifelse(train$bk==1,"Yes","No")
library(themis)
bCtrl$sampling <- "smote"
nrcore <- 5
cl <- makeCluster(mc <- getOption("cl.cores", nrcore))
registerDoParallel(cl)
set.seed(777)
logitsmote <- train(bk ~ ., data = train, method = "glm",
trControl = bCtrl)
glm.fit: fitted probabilities numerically 0 or 1 occurred
stopCluster(cl)
logitsmote$results %>%
kable(digits=2) %>%
kable_styling(latex_options = "HOLD_position")
| parameter | ROC | Sens | Spec | Accuracy | Kappa | logLoss | ROCSD | SensSD | SpecSD | AccuracySD | KappaSD | logLossSD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| none | 0.83 | 0.87 | 0.66 | 0.87 | 0.2 | 0.53 | 0.04 | 0.03 | 0.11 | 0.02 | 0.06 | 0.26 |
bCtrl$sampling <- "rose"
nrcore <- 5
cl <- makeCluster(mc <- getOption("cl.cores", nrcore))
registerDoParallel(cl)
set.seed(777)
logitrose <- train(bk ~ ., data = train, method = "glm", trControl = bCtrl)
There were missing values in resampled performance measures.
Something is wrong; all the Accuracy metric values are missing:
ROC Sens Spec Accuracy Kappa logLoss
Min. : NA Min. : NA Min. : NA Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA Max. : NA Max. : NA Max. : NA
NA's :1 NA's :1 NA's :1 NA's :1 NA's :1 NA's :1
Erreur : Stopping
Pred <- predict(logitsmote1, newdata =test)
confuslogsmote1<-caret::confusionMatrix(table(Pred, test$bk), positive="Yes", mode = "everything")
confuslogsmote1
Confusion Matrix and Statistics
Pred No Yes
No 3983 36
Yes 636 118
Accuracy : 0.8592
95% CI : (0.849, 0.869)
No Information Rate : 0.9677
P-Value [Acc > NIR] : 1
Kappa : 0.218
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.76623
Specificity : 0.86231
Pos Pred Value : 0.15650
Neg Pred Value : 0.99104
Precision : 0.15650
Recall : 0.76623
F1 : 0.25991
Prevalence : 0.03226
Detection Rate : 0.02472
Detection Prevalence : 0.15797
Balanced Accuracy : 0.81427
'Positive' Class : Yes
VOIR POUR FAIRE VRAI ANNEXES SUR R Script utilisé pour la génération des graphiques LHARBA
cat(readLines('Projet.rst'), sep = '\n')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("/content/data (12).csv")
p = df.iloc[: , 1:]
p1 = p.iloc[:,0:24]
p2 = p.iloc[:,24:48]
p3 = p.iloc[:,48:72]
p4 = p.iloc[:,72:]
p.shape ,p1.shape, p2.shape, p3.shape, p4.shape
----------------------------------------------------------------
plt.style.use('ggplot')
ncols1 = 3
nrows1 = int(np.ceil(len(p1.columns) / (1.0*ncols1)))
fig, axes = plt.subplots(nrows=nrows1, ncols=ncols1, figsize=(150, 150))
counter = 0
for i in range(nrows1):
for j in range(ncols1):
ax = axes[i][j]
if counter < len(p1.columns):
ax.hist(p1[p1.columns[counter]], color='Blue', alpha=0.9, label='{}'.format(p1.columns[counter]))
ax.set_xlabel(str(p1.columns[counter]), fontsize=60)
ax.set_ylabel('Count', fontsize=10)
ax.set_yscale('log')
else:
ax.set_axis_off()
counter += 1
fig1 = plt.gcf()
plt.show()
fig1.savefig("PP1.png")
ncols2 = 4
nrows2 = int(np.ceil(len(p2.columns) / (1.0*ncols2)))
fig, axes = plt.subplots(nrows=nrows2, ncols=ncols2, figsize=(100, 100))
counter = 0
for i in range(nrows2):
for j in range(ncols2):
ax = axes[i][j]
if counter < len(p2.columns):
ax.hist(p2[p2.columns[counter]], color='blue', alpha=1, label='{}'.format(p2.columns[counter]))
ax.set_xlabel(str(p2.columns[counter]), fontsize=50)
ax.set_ylabel('Count', fontsize=10)
ax.set_yscale('log')
else:
ax.set_axis_off()
counter += 1
fig2 = plt.gcf()
plt.show()
fig2.savefig("PP2.png")
ncols3 = 4
nrows3 = int(np.ceil(len(p3.columns) / (1.0*ncols3)))
fig, axes = plt.subplots(nrows=nrows3, ncols=ncols3, figsize=(100, 100))
counter = 0
for i in range(nrows3):
for j in range(ncols3):
ax = axes[i][j]
if counter < len(p3.columns):
ax.hist(p3[p3.columns[counter]],color='blue', alpha=1, label='{}'.format(p3.columns[counter]))
ax.set_xlabel(str(p3.columns[counter]), fontsize=50)
ax.set_ylabel('Count', fontsize=10)
ax.set_yscale('log')
leg = ax.legend(loc='upper left')
leg.draw_frame(False)
else:
ax.set_axis_off()
counter += 1
fig3 = plt.gcf()
plt.show()
fig3.savefig("P3.png")
ncols4 = 4
nrows4 = int(np.ceil(len(p4.columns) / (1.0*ncols4)))
fig, axes = plt.subplots(nrows=nrows4, ncols=ncols4, figsize=(100, 100))
counter = 0
for i in range(nrows4):
for j in range(ncols4):
ax = axes[i][j]
if counter < len(p4.columns):
ax.hist(p4[p4.columns[counter]],color='blue', alpha=1, label='{}'.format(p4.columns[counter]))
ax.set_xlabel(str(p4.columns[counter]), fontsize=50)
ax.set_ylabel('Count', fontsize=10)
ax.set_yscale('log')
leg = ax.legend(loc='upper left')
else:
ax.set_axis_off()
counter += 1
fig4 = plt.gcf()
plt.show()
fig4.savefig("P4.png")